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1  Introduction 


This  is  the  final  report  on  our  work  under  ARO-SDI  Contract  DAAL03-87-K-0033,  April  1,  1987 
through  March  31,  1990. 

The  major  results  are  in  two  areas: 

1.  Studies  of  systematic  design  procedures  for  a  class  of  structured  algorithms  of‘.en  encountered 
in  signal  processing  applications.  These  are  what  we  have  called  Regular  Iterative  Algorithms 
(RIAs)  for  which  our  results  are  summarized  in  Section  2. 

It  might  be  mentioned  that  these  ideas  have  been  successfully  used  by  one  of  our  former 
students  who  helped  to  develop  this  theory,  Dr.  S.  K.  Rao  of  AT&T  Bell  Laboratories  in 
Holmdel,  N.J.  Dr.  Rao  has  found  the  RI4.  results  helpful  in  designing  several  fast  integrated 
circuit  chips  for  communications  and  signal  processing  applications,  some  of  which  are  being 
used  in  the  AT&T  -  ZENITH  joint  effort  on  High  Definition  Television  (HDTV). 

2.  The  other  set  of  results  deals  with  the  issue  of  designing  configurable  and  fault-tolerant  pro¬ 
cessor  arrays  such  that  if  some  of  the  processors  in  the  given  array  are  faulty,  then  a  fault 
free  array  can  be  constructed  comprising  only  the  healthy  processors.  Such  studies  can  be 
easily  motivated  in  the  case  of  Wafer  Scale  Integration  (WSI)  technology  where  for  example, 
a  large  number  of  processors,  configured  in  the  form  of  a  grid,  can  be  put  on  a  single  wafer. 
Due  to  yield  problems,  some  of  the  processors  are  invariably  going  to  be  faulty.  In  such  a 
case,  instead  of  treating  the  whole  wafer  as  defective,  one  can  work  around  the  faulty  pro¬ 
cessors  and  reconfigure  the  rest  in  the  form  of  a  grid.  Thus,  reconfiguration  methodologies 
can  be  viewed  as  possible  tools  to  increase  the  effective  yield  of  the  processing  technology. 
The  general  models  that  we  have  explored  consist  of  a  set  of  identical  processors  embedded 
in  a  flexible  interconnection  structure  that  is  configured  in  the  form  of  a  rectangular  grid.  In 
particular,  we  studied  models  that  use  only  limited  hardware  resources  (such  as  a  single-track 
or  only  a  few  tracks  along  every  grid  line)  and  developed  the  first  known  efficient  algorithms 
for  reconfiguration  in  such  models.  In  the  process  we  have  also  developed  new  models  that 
use  limited  hardware  and  yet  has  higher  reconfigurability  than  other  models  studied  in  the 
literature.  Our  results  on  this  topic  are  summarized  in  Section  3. 


2  Summary  of  Our  Work  on  Regular  Iterative  Algorithms 

Our  previous  work  has  shown  that  (see  e.g.  ,  [9,  10,  26,  27,  28,  29,  37])  that  once  a  Regular 
Iterative  Algorithm  is  designed  for  a  given  problem,  then  one  can  use  the  systematic  design  theory 
developed  by  us  to  generate  efficient  processor  arrays.  However,  the  following  two  important  issues 
were  left  unresolved  in  the  general  area  of  designing  special  purpose  processor  arrays:  1.  Systematic 
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procedures  for  scheduling  and  mapping  any  given  RIA  was  not  fully  developed.  Our  previous  work 
had  answered  this  issue  partially  and  some  theoretical  gaps  had  remained  in  the  framework  2.  Most 
algorithms  are  not  given  to  the  designer  in  the  RIA  form  and  most  initial  representations  are  either 
sequential  in  nature  (e.g.,  FORTRAN  or  PASCAL  programs)  or  general  mathematical  expressions. 
Hence,  one  needs  to  develop  systematic  procedures  for  deriving  RIAs.  In  our  work  we  have  resolved 
both  the  above  issues.  In  particular,  we  have  developed  a  general  framework  for  scheduling  and 
mapping  any  given  RIA;  we  have  also  developed  a  formal  methodology  for  systematic  formulation 
of  RIAs  starting  from  representations  that  we  refer  to  as  linearly  indexed  Assignment  Codes.  It 
can  be  shown  that  such  codes  are  very  close  to  the  mathematical  expressions  of  a  wide  variety  of 
problems,  especially  in  signal  processing  and  matrix  algebra. 

In  this  section,  we  shall  first  briefly  introduce  RIAs  and  summarize  our  contributions  in  the 
analysis  and  implementation  of  such  algorithms.  We  shall  then  briefly  summarize  our  formal 
methodologies  for  scheduling  and  general  RIA  and  also  for  deriving  RIAs  starting  from  general 
representations  such  as  mathematical  formulas. 

2.1  Regular  Iterative  Algorithms  and  Our  Contributions 

A  formal  definition  of  RIAs  can  be  found  in  [13,  29,  37];  here  we  shall  introduce  RIAs  via  a  simple 

example. 

Example  2.1  (2-D  Filtering  Algorithm):  It  can  be  shown  (see  [26,  29])  that  certain 

numerically  stable  2-D  filtering  algorithms  due  to  Deprettere  and  Dewilde  [5],  Vaidyanathan  and 
Mitra  [43],  and  Fettweis  [6],  can  all  be  written  in  the  form: 

For  all  ( i,j,k ),  where  0  <  i  <  n  and  0  <  j,  k  <  N,  do 

1  ,k+  1)  =  fx,i{x{i,j,k),y{i,  j,k),w{i,  j,k)) 

2/(i+  1  ,j,k)  =  fy,i(x(i,j,k),y(i,  j,k),w(i,j,k)) 
w(i-  1  ,j,k)  =  fWti(x(i,j,k),w(i,j,k)) 

where  /I|t,  /y,j,  /„,t,  are  linear  functions  that  arc  determined  by  a  synthesis  procedure. 

□ 

The  example  displays  the  following  (characteristic)  features  of  an  RIA: 

Each  variable  in  the  RIA  is  identified  by  a  label  {e.g.,  x,  y  or  w  in  example  1)  and  an  index 
vector  {e.g.  ,  I  =  [i  j  I']7,  in  example  1).  The  set  of  ail  index  points  over  which  the  variables 
of  the  RIA  are  defined  is  called  the  index  space,  which  is  a  subset  of  the  an  5-dimensional 
integer  lattice,  Zs . 
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The  dependences  among  the  variables  are  regular  with  respect  to  the  index  points.  That  is, 
if  xi(/)  is  computed  using  the  value  of  xi (/  -  di2)  then  the  index  displacement  vector  di2, 
corresponding  to  this  direct  dependence,  is  the  same  regardless  of  the  index  point  /. 

The  set  of  computations  performed  at  every  index  point  is  often  referred  to  as  the  iteration  unit 
of  the  RIA.  Also,  note  that  although  the  direct  dependences  among  the  variables  in  an  RIA  are 
required  to  be  independent  of  the  index  points,  the  actual  computations  carried  out  to  evaluate 
these  variables  can  depend  on  the  index  point.  In  general,  the  index  space  I  will  be  semi-infinite 
along  certain  coordinates  and  bounded  along  others.  The  bounds  on  the  coordinates  will  be  referred 
to  as  the  size  parameters  of  the  RIA. 

The  regular  dependences  of  an  RIA  lead  to  a  dependence  graph  with  an  iterative  structure, 
which  can  be  clearly  demonstrated  by  embedding  the  dependence  graph  within  the  index  space. 
That  is,  a  set  of  V  nodes  is  defined  at  every  index  point  /  in  the  index  space  I,  where  the  ith  node 
represents  the  variable  x,(/)  in  the  RlA.  As  first  noted  by  Karp  et  al.  [13]  and  by  Waite  [45],  the 
regularity  of  the  dependence  graph  of  an  RIA  can  be  concisely  expressed  in  terms  of  a  simpler  and 
smaller  graph  called  the  Reduced  Dependence  Graph  (RDG).  The  RDG  of  an  RIA  (see  Fig.  1)  has 
one  node  for  each  of  the  indexed  variables  in  the  RIA;  it  has  a  directed  arc  from  node  xt  to  node  Xj. 
if  xj{I)  is  computed  using  the  value  of  x,(I  -  djj)  for  some  djj;  finally,  each  directed  arc  is  assigned 
a  vector  weight  representing  the  disnlacement  of  the  index  point  across  the  direct  dependence.  We 
should  note  that  the  RDG  and  a  specification  of  the  index  space  I,  completely  characterize  the 
dependence  graph  of  an  RIA;  hence,  the  analysis  of  parallelism  in  an  RIA  is  based  on  the  analysis 
of  the  RDG  instead  of  the  larger  dependence  graph. 

Some  of  our  important  results  are  enumerated  below;  for  a  detailed  account  of  the  work  reported 
here  previous  work  please  see  [26,  37] 

1.  A  formal  definition  of  systolic  arrays  was  obtained  that  captured  their  generally  accepted 
properties,  especially  regularity  (mostly  identical  processors),  spatial  locality  (local  intercon¬ 
nections),  temporal  locality  (no  delay-free  operations,  or  more  precisely,  all  combinational 
elements  are  latched)  and  pipelined  operation  (throughput  independent  of  the  order,  suitably 
defined,  of  the  system).  Some  authors  ( e.g ,  Leiserson  et  al.  [19])  had  used  only  a  subset 
of  these  properties,  but  the  consensus  in  the  literature  appeared  to  have  required  all  those 
mentioned  above  (see  e.g.,  [28]  and  [16]). 

2.  A  reasonable  generalization  of  the  concept  of  systolic  arrays  that  allowed  implementation  of  a 
larger  class  of  algorithms  (including  of  course  all  systolic  algorithms)  was  also  developed.  The 
generalization  allowed  the  presence  of  register  pipelines  of  various  lengths  at  different  points  in 
a  regular  array  of  (mostly)  identical  processors,  and  sometimes  also  some  LIFO  (Last-In-First- 
Out)  buffers.  Such  architectures  have  almost  all  the  advantages  that  make  systolic  arrays  so 
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Figure  1:  The  RDG  of  the  RIA  in  the  above  example. 

appealing  for  VLSI;  the  only  added  requirement  is  that  some  of  the  processors  may  require 
certain  amount  of  memory  in  them.  We  should  note  here  that  the  memory  requirement  is 
not  a  major  bottleneck,  and  certain  commercial  products  such  as  the  WARP  developed  at 
CMU,  routinely  provide  such  on-processor  memory. 

Rao  et  al.  called  such  arrays  Regular  Iterative  Arrays,  and  algorithms  implementable  on  such 
arrays  were  dubbed  as  Regular  Iterative  Algorithms.  It  is  convenient  to  use  the  acronym  RIA 
to  stand  for  either  of  these  concepts,  the  exact  one  to  be  inferred  from  the  context.  Using 
the  above  concepts,  and  their  consequences,  one  can  show  for  example  that  there  are  Regular 
Iterative  Algorithms  ( e.g .  ,  RIAs  for  certain  classes  of  2-D  filtering  algorithms,  RIAs  for 
certain  pivoting  algorithms  [37,  31,  34]  etc.)  that  cannot  be  implemented  on  systolic  arrays, 
as  formally  defined,  but  can  be  implemented  on  regular  iterative  arrays. 

3.  It  was  also  shown  [10,  16,  26,  37]  that  many  algorithms  in  digital  filtering  (convolution, 
correlation,  autoregressive,  and  moving-average  filtering),  numerical  linear  algebra,  discrete 
methods  for  PDEs  and  ODEs,  graph  theory  (transitive  closure,  some  coloring  problems) 
can  be  reformulated  as  RIAs.  Systematic  procedures  for  converting  algorithms  into  RIAs, 
however,  remained  as  an  open  problem. 

4.  For  any  RIA,  formal  methods  to  determine  lower  bounds  on  I/O  latency  and  memory  re¬ 
quirements  were  developed;  systematic  procedures  for  implementing  most  RIAs  on  regular 
processor  arrays  that  can  achieve  the  lower  bound  on  I/O  latency  were  also  proposed  (see 
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[26,  27,  29,  37]).  We  should  mention  here  that  these  formal  mapping  techniques  can  generate 
several  possible  architectures,  though  in  practice  one  stops  once  a  few  efficient  ( i.e .  ,  meeting 
certain  performance  lower  bounds)  arrays  have  been  obtained. 

The  theoretical  question  of  scheduling  and  mapping  any  given  RIA  was  left  as  an  open 
problem.  Our  recent  results  on  this  topic  are  summarized  in  Section  2.1. 

5.  In  the  design  of  systolic  arrays,  several  issues  such  as  systematic  procedures  for  designing 
multi-rate  systolic  arrays  were  resolved.  In  the  conventional  systolic  array  designs  all  oper¬ 
ations  were  assumed  to  take  the  same  amount  of  time;  this  led  to  unrealistic  and  inefficient 
design.  Our  design  procedure  allows  one  to  carry  out  the  design  with  more  realistic  processor 
modules  that  can  increase  the  throughput  by  exploiting  the  fact  that  the  time  required  to 
carry  out  different  operations  is  generally  different. 

2.2  Scheduling  and  Implementing  A  Given  RIA 

Herein  we  describe  our  novel  subspace  scheduling  scheme  that  can  be  used  to  schedule  and  implement 
any  given  RIA.  Based  on  the  analysis  of  the  RDG,  Karp  et  al.  [13]  showed  that  for  a  certain  subclass 
of  RIAs,  one  can  always  determine  a  scheduling  vector  A  and  scalars  7*.,  such  that  a  variable  xt(I) 
can  be  scheduled  at  step  A T/ +  7X,.  Such  schedules  will  be  called  uniform  affine  schedules.  Thus, 
two  variables  x,(I)  and  x,(«/)  are  assigned  the  same  schedule  if  A T(I  -  J)  =  0.  Equivalently,  two 
variables  x,-(/)  and  x,(J)  are  assigned  the  same  schedule  if  /  and  J  lie  on  the  same  hyperplane 
defined  by  the  normal  vector  A.  Therefore,  for  such  RIAs  one  can  draw  isotemporal  hyperplanes 
(see  Fig.  2)  in  the  index  space,  and  because  of  this  geometric  interpretation,  these  schedules  are 
often  referred  to  as  hyperplanar  schedules.  We  should  note  here  that  Rao  et  al.  [26],  [27]  showed 
that  algorithms  implementable  on  systolic  arrays  are  exactly  those  RIAs  that  admit  uniform  affine 
schedules. 

Hyperplanar  schedules,  however,  do  not  exist  for  all  RIAs;  one  can  show  (see  e.g.  ,  [26],  [29]) 
that  the  RIA  in  example  2  cannot  have  a  hyperplanar  schedule.  Nonetheless,  it  turns  out  that 
the  scheduling  properties  of  an  RIA  can  be  determined  from  the  information  gathered  during  the 
execution  of  a  so-called  computability  analysis  procedure.  The  aim  of  the  computability  analysis 
is  to  determine  whether  the  given  RIA  is  well-defined.  A  well-defined  RIA  should  not  have  any 
directed  cycles  (i.e.  ,  the  execution  of  a  computation  should  not  depend  on  its  output)  or  any 
directed  paths  coming  from  infinity  (i.e.  ,  every  computation  can  be  completed  in  a  finite  number 
of  steps)  in  its  dependence  graph.  The  computability  analysis  leads  to  an  iterative  decomposition 
of  the  RDG,  which  results  in  a  tree  structure  called  the  computability  tree.  It  is  shown  (see  [33], 
[13],  [26])  that  an  RIA  cannot  admit  hyperplanar  schedules  if  the  depth  of  its  computability  tree, 
f,  is  greater  than  1. 
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Figure  2:  Isotemporal  hyperplanes  in  the  index  space  of  an  RIA  that  admits  uniform  affine  sched¬ 
ules. 

In  our  work,  [37,  32]  we  have  shown  that  hyperplanar  schedules,  valid  for  only  a  restricted 
subclass  of  RIAs,  can  be  extended  and  that  asymptotically  optimal  subspace  schedules  can  be 
determined  to  schedule  any  RIA  defined  over  a  bounded  or  a  semi-infinite  index  space.  In  particular, 
we  show  that  if  the  depth  of  the  computability  tree  is  l,  and  the  dimension  of  the  index  space  is 
5,  then  for  every  indexed  variable  x <,  one  can  determine  a  linear  subspace  Lt  of  dimension  S  -  l. 
such  that  two  variables  x,(/)  and  x,(J)  can  be  scheduled  at  the  same  step  if  (/ -  J)  £  L,.  Thus, 
the  proposed  scheduling  scheme  defines  isotemporal  surfaces  that  are  linear  subspaces  of  dimension 
5  —  /,  instead  of  dimension  5—1  (which  is  the  case  for  hyperplanar  schedules).  Fig.  3  shows  the 
isotemporal  lines  in  the  index  space  of  the  RIA  for  2-D  filtering  presented  in  example  2.  We  should 
mention  here  that  we  have  also  devised  alternative  algebraic  techniques  for  scheduling  RIAs  defined 
over  bounded  index  spaces.  These  algebraic  scheduling  techniques,  however,  do  not  always  have 
the  same  geometrical  interpretation  as  the  subspacc  scheduling  schemes  and  are  not  valid  for  RIAs 
defined  over  semi-infinite  index  spaces  (note  that,  Karp  et  al.  mostly  studied  RIAs  defined  over 
semi-infinite  index  spaces  only.) 

We  also  show  that  the  subspace  scheduling  scheme  described  above,  can  be  used  to  characterize 
the  extent  of  parallelism  in  RIAs  (i.e.  ,  whether  the  given  RIA  has  unbounded  parallelism  or  not). 
This  was  attempted  by  Karp  et  al.  [13]  in  their  seminal  paper;  however,  the  general  result  that,  we 
prove  here  was  left  as  a  conjecture.  In  particular,  we  show  that  if  the  computability  tree  is  of  depth 
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Isotemporal 

lines 

Figure  3:  Isotemporal  lines  for  variables  y  and  w  in  the  index  space  of  the  RIA  in  Example  2.1. 
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I  <  S  (where  5  is  the  dimension  of  the  index  space)  then  it  always  has  unbounded  parallelism,  and 
if  l  =  S  then  it  has  bounded  parallelism  except  possibly  at  the  boundaries  of  the  index  space. 

The  next  issue  can  be  motivated  by  observing  that  for  RIAs  admitting  hyperplanar  schedules, 
one  can  write  the  schedule  of  a  variable  x,(I)  explicitly  as  \r [  +  7Xi.  Rao  et  al.  [26],  [29],  [27], 
[10],  extended  this  result  and  obtained  explicit  schedules  for  all  RIAs  defined  over  bounded  index 
spaces;  however,  explicit  schedules  for  RIAs  defined  over  semi-infinite  index  spaces  have  not  been 
developed.  The  explicit  scheduling  functions  are  still  affine,  i.e.  ,  the  schedule  for  a  variable  x,([) 
can  be  written  as  A J I  -f  7Xt;  however,  the  vector  A,  may  be  different  for  different  variables  and 
its  entries  along  with  7X|  will  be  functions  of  the  size  parameters  (i.e.  ,  the  bounds)  of  the  RIA. 
For  RIAs  that  are  defined  over  semi-infinite  index  spaces  we  show  that  we  can  identify  a  precisely 
defined  subclass,  such  that  the  explicit  schedule  of  a  variable  x,(  /)  can  be  expressed  as  a  polynomial 
in  the  indices  of  the  index  point  I.  For  RIAs  not  belonging  to  this  subclass,  we  show  that  one  can 
always  come  up  with  examples  such  that  the  schedule  of  £,(/)  is  exponential  or  double  exponential 
in  the  indices  of  /.  However,  we  are  able  to  refine  our  analysis  further  and  provide  a  necessary  and 
sufficient  condition  for  the  schedule  of  a  variable  £,(/)  to  be  polynomials  bounded. 

The  results  presented  so  far,  on  the  analysis  of  parallelism  in  RIAs,  are  independent  of  im¬ 
plementation  in  any  particular  machine  and  can  be  used  for  executing  RIAs  in  parallel  on  any 
architecture.  Previous  work  by  several  researchers  beginning  with  [21],  however,  has  shown  that 
RIAs  are  particularly  well  suited  for  implementation  on  regular  mesh-connected  processor  arrays. 
The  proposed  technique  is  to  define  a  linear  subspace,  called  the  iteration  space ,  such  that  compu¬ 
tations  corresponding  to  variables  x,-(/)  and  x,(  J)  are  assigned  to  the  same  processor  if  the  vector. 
I  -  J  belongs  to  the  iteration  space.  A  geometric  interpretation  of  the  mapping  technique  can 
be  obtained  by  decomposing  the  index  space  into  parallel  subspaces  such  that  variables  belonging 
to  the  same  subspace  are  assigned  to  the  same  processor.  Fig.  4  shows  the  iteration  space  and  a 
resultant  processor  array  for  implementing  the  RIA  in  example  2.  It  can  be  shown  that,  because 
of  the  regular  structure  of  dependence  graphs  of  RIAs,  resultant  processor  arrays  are  regular  with 
fixed  and  local  interconnections.  This  linear  projection  scheme  for  partitioning  computations  is 
only  the  first  step  towards  parallel  implementation,  and  one  has  to  make  sure  that  the  partitioning 
scheme  is  compatible  with  scheduling  techniques,  i.e.  ,  two  computations  scheduled  at  the  same 
step  should  not  be  assigned  to  the  same  processor. 

We  have  shown  [26],  [33],  [29]  that  given  an  RIA  that  is  scheduled  by  a  subspace  scheduling 
scheme  as  described  before,  one  can  always  determine  compatible  iteration  spaces.  Recall  that  in 
our  subspace  scheduling,  £,(/)  and  i,(7)  are  assigned  the  same  schedule  if  I  -  J  belongs  to  an 
isotemporal  subspace;  note  that,  isotemporal  subspaces  may  be  different  for  different  variables  x,. 
Thus,  for  an  iteration  space  to  be  compatible  with  a  subspace  schedule,  one  should  satisfy  the 
following  condition:  if  /  —  J  (/  ^  J)  is  in  the  iteration  space  then  it  should  not  be  in  any  of  the 
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Figure  4:  An  iteration  space  and  a  corresponding  processor  array  for  the  RIA  in  Example  1.2. 
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isotemporal  subspaces  and  vice-versa.  In  other  words,  an  iteration  space  will  be  compatible  with  a 
subspace  schedule  if  and  only  if  it  has  only  zero  intersection  with  the  isotemporal  subsparo  of  each 
variable  x,  in  the  RIA.  Therefore,  the  procedure  for  designing  processor  arrays  can  be  reduced  to 
the  linear  algebraic  problem  of  determining  a  subspace  that  has  no  non-zero  intersection  with  a 
finite  number  of  given  subspaces. 

We  can  now  briefly  summarize  our  new  results  on  the  analysis  and  implementation  of  RIAs  as 
follows  (these  results  are  based  on  the  novel  subspace  scheduling  scheme  that  we  have  developed): 

1.  We  have  developed  a  scheme  to  schedule  any  given  RIA  defined  over  a  bounded  or  semi-infinite 
index  space.  This  generalizes  the  so-called  hyperplanar  schedules. 

2.  We  have  developed  procedures  to  determine  explicit  schedules  i.e.  ,  formulas  for  the  schedules 
for  a  variable  Xi(I)  are  expressed  as  a  function  of  i,  I  and  possibly  also  of  the  size  parameters 
of  the  RIA.  We  have  also  analyzed  the  extent  of  parallelism  in  RIAs;  this  analysis  leads  to  a 
proof  of  a  conjecture  made  in  the  seminal  paper  of  Karp  et  al.  [13]. 

3.  We  have  shown  that  for  any  given  RIA  scheduled  according  to  our  subspace  scheduling  scheme, 
there  always  exist  compatible  mesh-connected  processor  arrays  for  implementing  it  in  parallel. 

2.3  Systematic  Formulation  of  RIAs 

Let  us  first  introduce  the  concept  of  localized  algorithms  that  are  close  to  RIAs  (see  e.g.  ,  [37,  36.  35, 
12,  33]).  The  definition  of  the  localized  algorithms  is  motivated  by  the  observation  that  there  are 
certain  problems  that  can  be  solved  by  algorithms  that  have  regular  dependence  graphs  that  are  not 
completely  homogeneous.  That  is,  the  dependence  graphs  may  have  dependencies  or  computations 
that  are  present  only  in  certain  portions  of  the  dependence  graphs.  As  we  shall  discuss  in  [37],  one 
way  of  handling  such  cases  is  to  assume  that  the  dependences  and  the  computations  are  present 
everywhere  in  the  index  space  and  then  to  apply  the  results  for  RIAs.  There  are  several  problems 
where  this  approach  is  reasonable;  for  example  the  Gaussian  elimination  algorithm  without  pivoting 
can  be  first  written  in  the  localized  algorithm  form  and  then  can  be  implemented  on  processor  arrays 
by  modeling  the  localized  algorithm  as  an  RIA.  The  other  approach  is  to  break  up  the  dependence 
graph  into  more  than  one  component  such  that  the  dependence  graph  is  homogeneous  over  each 
component.  The  mapping  techniques  can  then  be  applied  to  each  such  component  with  special 
consideration  to  the  dependences  at  the  boundaries  between  the  components.  The  latter  approach 
is  discussed  in  more  detail  in  [37]  where  the  example  of  Gauss-Jordan  elimination  algorithm  is 
worked  out. 

The  localized  algorithms  have  statements  of  the  form 


*«(/)  =  fx{x\{I  -  xv(I  -  diV))  V/G  Ij. 


(1) 
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Thus  each  statement  in  this  algorithm  may  have  a  different  index  space  of  its  own;  as  a  comparison, 
all  statements  in  an  RIA  have  the  same  index  space. 

Partial  attempts  have  been  made  by  several  authors,  including  [14],  [22],  [15],  [25],  [4]  and  [7],  to 
formalize  the  conversion  procedure  for  going  from  an  initial  representation  to  an  RIA  or  a  localized 
algorithm.  The  first  step  always  is  to  convert  algorithms  into  equivalent  Single  Assignment  Codes 
(SACs)  and  the  second  step  tries  to  localize  the  dependences  by  eliminating  broadcasts.  Single 
assignment  codes  [2]  are  representations  where  every  variable  defined  in  the  algorithm  takes  on  a 
unique  value  during  the  course  of  execution.  The  fact  that  the  dependence  graph  of  an  algorithm  can 
be  easily  determined  from  its  SAC,  has  made  SACs  a  very  useful  starting  representation  for  parallel 
implementations  of  algorithms.  Considering  its  importance,  a  lot  of  work  has  been  done  in  trying  to 
convert  sequential  algorithms  into  SACs,  see  e.g.,  [24].  However,  sequential  algorithms  are  not  the 
only  representations  from  which  SACs  can  be  derived.  Often  SACs  can  be  derived  systematically 
from  given  mathematical  expressions.  Consider  a  mathematical  expression  for  matrix  multiplication 


For  all  tuples  (i,  j)  ,  1  <  i,j  <  n  do 


and  a  SAC 


cij  ■=  a'k  '  bkr 

for  all  l <k<n 

For  all  triples  (i,j,  fc),  l<i,j,k<ndo 


(2) 


c(i,j,k+  1)  :=  c(i,j,k)  +  a,k-bkj  (3) 

In  the  mathematical  expression,  the  ordering  of  operations  in  the  inner  product  is  not  specified 
and  in  fact  it  can  be  arbitrary  because  of  the  commutativity  and  associativity  of  the  operation  +. 
However,  in  the  given  SAC  the  ordering  is  fixed  and  a  degree  of  freedom  has  been  lost.  Since  the 
original  representation  has  more  freedom  and  potential  parallelism  in  it,  it  would  be  desirable  to 
make  it  the  starting  representation  and  then  systematically  derive  one  or  more  SACs  from  it.  It 
turns  out  that  a  number  of  algorithms  can  be  written  in  the  form  of  (2)  (see  [37,  36,  35]),  and 
we  shall  refer  to  such  representations  as  Assignment  Codes  (ACs).  The  prefix  ‘Single’  has  been 
intentionally  dropped  to  emphasize  the  fact  that  in  such  representations  the  number  of  inputs  for 
computing  a  variable  may  depend  on  the  problem  size,  as  opposed  to  a  conventional  SAC  where  the 
number  of  inputs  to  every  variable  is  restricted  to  be  some  constant,  independent  of  the  problem 
size.  From  now  on  we  shall  refer  to  the  number  of  inputs  to  a  variable  as  its  in-degree  and  the 
number  of  variables  that  a  particular  variable  is  input  to  as  its  out-degree.  If  the  in-  or  out-degree 
of  a  variable  depends  on  the  problem  size,  then  we  shall  define  it  to  be  unbounded.  Thus,  the 
variables  in  ACs  can  have  unbounded  in-  and  out-degrees,  whereas  in  SACs  the  variables  have 
bounded  (i.e.,  constant)  in-degrees  but  may  have  unbounded  out-degrees. 
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We  shall  further  restrict  ourselves  to  linearly  indexed  ACs,  which  can  be  shown  to  be  very  dose 
to  mathematical  expressions  for  a  number  of  problems,  especially  in  signal  processing  and  matrix 
algebra.  A  linearly  indexed  AC  has  statements  of  the  form 

x(PI  +  d)  depends  on  y(QI  +  e)  for  all  /  6  I  C  Zs  (4) 

where  P  and  Q  are  integral  matrices  independent  of  /,  I  is  an  index  space  which  is  the  set  of  all 
lattice  points  enclosed  within  a  specified  region  in  a  5-dimensional  Euclidean  space  and  d.  e  are 
constant  displacement  vectors.  P  and  Q  are  often  referred  to  as  the  indexing  matrices.  We  have 
shown  in  [37,  36,  35]  that  in-  and  out-degrees  of  variables  x  and  y  are  completely  determined  by  the 
structure  and  dimension  of  the  right  null-space  of  each  of  the  indexing  matrices.  Many  algorithms 
are  actually  directly  available  as  (4),  and  examples  include  the  formulas  for  matrix  multiplication, 
any  m-dimensional  convolution/correlation,  matrix  transposition,  and  solving  matrix  Lyapunov's 
equation.  Algorithms  that  are  not  directly  in  the  form  of  (4)  can  often  be  easily  put  in  that  form 
by  analyzing  their  sequential  representations  (see  [37]). 

Example  3.1:  The  formula  for  matrix  multiplication  is: 


For  all  tuples  (i,  j)  ,  1  <  i,j  <  n  do 


Cij  :=  a'k'  bki 

for  all  l  <k<n 

The  index  space  of  the  example  is  I  =  {(i,j,k)  |  1  <  i,j,k  <  rc}.  There  is  one  functional  relation 
in  the  given  AC  with  the  dependence  matrices 


Pc  = 


1  0  0 
0  1  0 


Qac 


1  0  0 
o  o  1 


Qbc  = 


o  1  o 
o  o  1 


and  the  displacement  vectors  d,e  =  0.  □ 

We  have  shown  that  a  linearly  indexed  AC  can  be  systematically  decomposed  into  a  linearly  indexed 
SAC  and  a  linearly  indexed  dual  SAC.  Linearly  indexed  dual  SACs  may  have  variables  with  un¬ 
bounded  in-degrees  but  bounded  out-degrees,  as  opposed  to  the  linearly  indexed  SACs,  which  have 
variables  with  bounded  in  degrees  but  possibly  unbounded  out  degrees.  Formal  procedures  will  be 
then  outlined  for  converting  linearly  indexed  SACs  and  linearly  indexed  dual  SACs  into  localized 
algorithms.  The  conversion  of  linearly  indexed  SACs  to  localized  algorithms  involves  eliminating 
global  dependencies  by  propagating  variables  in  a  systematic  manner  in  the  index  space.  The 
conversion  of  dual  SACs  to  localized  algorithms  is  achieved  by  distributing  computations  and  in¬ 
troducing  an  ordering  among  the  computations.  The  two  conversion  procedures  turn  out  to  be 
duals  of  each  other.  We  should  mention  here  that  starting  with  linearly  indexed  ACs  is  by  no 
means  essential  in  our  approach;  if  one  cannot  find  a  AC  easily,  then  one  can  try  to  use  other 
well-known  techniques  and  start  the  procedure  with  a  linearly  indexed  SAC. 
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In  summary,  we  have  developed  a  hierarchical  procedure  for  going  from  a  higher  level  represen¬ 
tation  of  an  algorithm  to  a  localized  algorithm,  which  can  be  described  by  an  RIA  or  a  localized 
algorithm.  It  can  be  described  as  follows: 

Mathematical  Description  —  Linearly  Indexed  Assignment  Codes  — ►  Linearly  Indexed 

Single  Assignment  Codes  and  dual  Single  Assignment  Codes  — ►  Localized  Algorithms. 

The  conversion  procedure  is  by  no  means  unique  and  a  number  of  localized  algorithms  can  be  gen¬ 
erated  starting  from  the  same  AC.  To  enable  an  efficient  choice  we  have  also  developed  procedures 
to  directly  schedule  and  analyze  linearly  indexed  codes  of  the  form  (4).  For  example,  we  have 
developed  necessary  and  sufficient  conditions  for  determining  whether  a  sequence  of  SACs  of  the 
form  (4)  can  be  scheduled  using  affine  schedules  (see  [37,  12,  33]).  Procedures  to  schedule  linearly 
indexed  codes  that  do  not  admit  affine  schedules  are  also  discussed  in  [37]. 

3  Summary  of  Our  Work  on  Reconfigurable  Arrays 

As  evident  from  our  work  on  regular  iterative  arrays  (summarized  in  Section  2),  an  array  of  identical 
processing  elements  is  an  indispensable  architecture  in  the  VLSI  and  WSI  technology  and  proves 
itself  very  useful  in  parallel  processing  applications.  Often,  however,  during  the  fabrication  process 
or  during  run-time,  some  of  the  processing  elements  in  a  large  array  are  inevitably  going  to  be 
faulty.  Spare  PEs  and  extra  routing  hardware  are  often  provided  so  that  a  fault-free  array  can  be 
constructed;  such  reconfiguration  capability  can  be  used  to  increase  the  yield,  and  to  guarantee 
fault  tolerance  in  applications  when  failure  is  not  permissible.  Our  work  in  this  regard  has  been 
concerned  with  the  design  and  analysis  of  such  configurable  fault-tolerant  arrays. 

The  general  model  considered  by  us  is  shown  in  Fig.  5  (see  e.g.  ,  [20,  23,  30,  39,  40,  41,  42]): 
it  consists  of  a  set  of  identical  processors  embedded  in  a  flexible  interconnection  structure  that  is 
configured  in  the  form  of  a  rectangular  grid.  Each  grid  line  in  the  mesh  has  a  fixed  number  of  data 
paths  that  can  be  routed  along  it  (i.e.  ,  the  model  has  fixed  channel  width);  switches  can  be  placed 
at  every  grid  point  and  at  every  location  where  a  processor  is  connected  to  the  grid.  Furthermore, 
often  the  processors  are  divided  into  a  set  of  non-spare  PEs  (say  an  m  x  n  array)  and  a  set  of  spare 
PEs  that  are  distributed  in  a  pre-determined  fashion. 

Given  a  set  of  faulty  PEs,  the  objective  is  to  reconfigure  the  connections  among  the  PEs  such 
that  a  new  rectangular  logical  array  is  formed  comprising  only  the  healthy  PEs  and  demanding 
no  more  hardware  resources  (e.g.,  spare  PEs,  tracks,  and  switches)  than  available.  It  is  obvious 
that  the  more  the  additional  hardware,  the  higher  is  the  reconfiguration  probability.  Nevertheless, 
space  and  cost  limitations  might  make  it  impossible  to  add  as  much  hardware  as  one  would  want. 
Such  considerations  lead  to  the  following  design  question:  What  is  the  optimal  amount  of  required 
hardware  resources,  i.e.,  the  number  of  spare  PEs,  the  channel  width  and  the  distribution  of  routing 
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switches,  such  that  the  resulting  architecture  has  high  reconfigurability?  A  related  important  ques¬ 
tion  addresses  the  ease  of  reconfiguration:  Given  a  configurable  architecture  with  fixed  resources, 
are  there  efficient  and  simple  algorithms  for  reconfiguring  such  architectures  with  high  probability? 

In  our  recent  work  we  have  provided  partial  answers  to  both  the  above  issues.  However,  to 
further  motivate  our  results  and  contributions,  and  to  introduce  some  of  the  relevant  concepts  we 
shall  first  briefly  review  the  previous  work  in  this  area. 

3.1  Background 

In  the  context  of  reconfiguration  of  processor  arrays,  several  researchers  [8,  18]  have  addressed 
important  analytical  issues  of  the  following  nature.  Given  an  n  x  n  array  of  PEs  (or  cells),  each 
of  which  can  be  faulty  with  probability  p:  (1)  Can  one  devise  ‘good'  algorithms  that  will  connect 
the  non-faulty  cells  in  the  form  of  a  grid  with  high  probability ?  (2)  With  high  probability,  what 
are  the  required  channel  width  (defined  as  the  maximum  number  of  interconnection  paths  routed 
along  any  grid  line)  and  the  length  of  the  longest  interconnecting  wire? 

As  mentioned  before,  instead  of  asymptotic  analysis,  our  studies  have  been  motivated  by  a  more 
practical  query:  how  to  reconfigure  arrays  with  only  a  small  amount  of  extra  routing  hardware.  A 
general  methodology  to  reconfigure  arrays  with  faulty  PEs  is  to  determine  the  so-called  compensa¬ 
tion  paths.  A  compensation  path  is  comprised  of  a  sequence  of  substitutions  that  logically  replaces 
a  faulty  PE  by  a  spare  one,  and  can  be  described  as  follows.  Let  a  non-spare  PE  at  location  (x  y) 
be  faulty,  then  in  any  valid  reconfiguration  it  has  to  be  replaced  by  a  healthy  processor.  Let  the 
faulty  PE  at  (x,y)  be  logically  replaced  by  a  healthy  PE,  say  at  location  (x',y');  logical  replace¬ 
ment  implies  that  in  the  reconfigured  array  the  physical  PE  at  location  (x',y')  will  be  reindexed 
as  (x,y).  The  PE  at  (x',y')  is  in  turn  replaced  by  a  healthy  PE,  say  at  location  (x",  y'')\  one  can 
continue  this  chain  until  one  ends  up  at  a  spare  PE.  Now  a  compensation  path  can  be  defined  as  the 
ordered  sequence  of  nodes  ( x,y ),  (x',y'),  (x",y"),  •  •  •,  involved  in  the  replacement  chain.  Thus, 
the  compensation  paths  determine  neighbors  of  each  PE  in  the  logical  or  the  reconfigured  array. 
Hence,  once  the  compensation  paths  are  determined,  the  reconfiguration  procedure  is  completed 
by  connecting  each  PE  to  its  logical  neighbors. 

It  is  easy  to  see  that  if  the  number  of  faulty  PEs  is  less  than  the  number  of  spare  PEs.  then  one 
can  always  define  a  set  of  compensation  paths  for  successful  reconfiguration.  However,  the  charac¬ 
teristics  of  the  compensation  paths  (e.g.,  the  geometrical  distances  between  consecutive  nodes,  or 
the  relative  positions  of  the  nodes  in  the  grid)  determine  the  amount  of  routing  hardware  needed 
to  implement  the  necessary  connections  among  the  logical  neighbors.  It  can  be  easily  shown  that  if 
the  number  of  routing  tracks  is  fixed,  then  one  cannot  allow  arbitrary  sets  of  compensation  paths. 
In  other  words,  by  limiting  the  hardware  resources  one  limits  the  number  of  faulty  patterns  that 
one  can  reconfigure.  Hence,  a  natural  question  to  ask  is  how  many  tracks  should  one  provide  so  as 
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to  allow  a  large  enough  class  of  compensation  paths,  and  yet  keep  the  hardware  redundancy  low. 

A  model  with  very  limited  hardware  resources  has  been  studied  in  [16,  17,  38].  It  consists  of  an 
m  x  n  array  of  non-spare  PEs,  1  row  (or  column)  of  spare  PEs  along  each  boundary,  a  single-track 
along  every  grid  line  ( i.e .,  channel  width  =  1),  and  single-track  switches  located  at  intersections 
where  processors  are  connected  to  the  grid.  It  is  further  assumed  that  a  faulty  PE  can  be  converted 
into  a  connecting  element,  thereby  making  an  implicit  assumption  that  there  is  an  extra  channel 
within  every  PE  (because  of  the  assumption  such  models  have  also  been  referred  to  as  an  1  j-track 
model).  The  single-track  switch  model’s  advantages  arises  from  its  inherent  simplicity:  since  data 
paths  take  up  significant  amount  of  area  on  a  wafer/chip,  considerable  saving  in  area  is  achieved  by 
allowing  only  one  data  path  along  every  grid  line;  moreover,  the  simplicity  of  the  switches  makes 
the  routing  hardware  more  reliable.  Furthermore,  extensive  simulations  reported  in  [17]  show  that 
considerable  enhancement  in  yield  can  be  achieved  by  reconfiguring  the  array  grid  models  with 
single-track  switches. 

We  now  briefly  discuss  the  results  reported  in  [17].  The  paper  derives  a  set  of  sufficient  conditions 
(stated  in  the  form  of  a  so-called  reconfigurability  theorem  presented  below)  for  determining  whether 
an  array  with  a  particular  distribution  of  faulty  processors  is  reconfigurable :  where,  a  given  array  is 
reconfigurable  if  the  non-faulty  processors  can  be  connected  to  form  an  m  x  n  array.  The  sufficient 
conditions  restrict  the  compensation  paths  to  be  straight.  Fig.  7  shows  a  compensation  path  and 
the  corresponding  routing  required  for  replacing  a  single  faulty  processor  in  the  single-track  model; 
note  that  the  compensation  path  is  straight  and  continuous.  This  simple  concept  of  using  straight 
and  continuous  compensation  paths  can  be  also  used  in  the  presence  of  multiple  faulty  processors 
and  the  sufficient  conditions  can  be  formally  presented  in  the  form  of  the  so-called  reconfigurability 
theorem  (for  a  formal  proof,  see  [16,  17]): 

Reconfigurability  Theorem:  Given  an  m  X  n  array  of  non-spare  PEs,  with  spare  PEs  along 
the  sides,  it  is  reconfigurable  into  an  m  X  n  array  of  healthy  processors  by  single-track  switches  if  1 ) 
there  exists  a  ret  of  continuous  and  straight  compensation  paths  covering  all  the  faulty  non-spare 
PEs  and  2)  there  is  neither  intersection  or  near-miss  among  the  compensation  paths. 

A  near-miss  situation  occurs  if  two  compensation  paths  in  neighboring  rows  (columns)  overlap 
and  are  in  opposite  directions  (see  Fig.  8;  note  that  a  near-miss  situation  does  not  occur  if  the 
compensation  paths  overlap  by  only  one  node). 

In  [17]  an  algorithm  to  determine  valid  reconfigurations  that  satisfy  the  conditions  in  the  recon¬ 
figurability  theorem  is  also  presented.  The  algorithm  is  developed  by  reformulating  the  reconfigura¬ 
bility  problem  as  a  maximum  independent  set  problem,  and  then  adapting  a  well  known  algorithm 
for  determining  maximum  independent  sets  in  a  graph.  However,  the  maximum  independent  set 
problem  is  NP-complete  and  the  best  known  algorithms  take  exponential  time;  hence,  the  algo¬ 
rithm  presented  in  [17]  has  exponential  complexity.  The  question  whether  efficient  polynomial  time 
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algorithms  exist  was  left  as  an  open  one.  Moreover,  efficient  algorithms  were  not  known  even  for 
the  restricted  cases  where  spare  processors  are  available  only  along,  for  example,  two  or  three  sides 
(as  opposed  to  on  all  four  sides  as  shown  in  Fig.  6). 

3.2  New  Contributions 

In  view  of  the  above  results,  the  contributions  of  our  work  with  regard  to  the  single-track  models 
can  be  summarized  as  follows  (see  [38]): 

•  We  showed  that  the  conditions  in  the  reconfigurability  theorem  are  not  necessary  correcting 
a  claim  made  in  [11]. 

•  We  developed  a  polynomial  time  algorithm  (in  fact,  the  complexity  is  0(|F|1 2).  where  |  F| 
is  the  number  of  faulty  processors)  for  determining  valid  reconfigurations  according  to  the 
sufficient  conditions.  Moreover,  linear  time  algorithms  for  determining  valid  reconfigurations 
are  developed  for  the  restricted  cases  where  the  spare  processors  are  not  present  along  all  four 
sides  of  the  array. 

We  should  note  here  that  the  combinatorial  problem  underlying  the  reconfigurability  issues  in  the 
single-track  model  is  a  very  interesting  geometrical  problem  on  rectangular  grids,  and  can  be  stated 
as  follows: 

Problem  1  Let  V  be  the  set  of  grid  points  in  an  m  X  n  2-dimensional  grid,  and  let  F  C  l  . 
Determine  a  set  of  straight  lines  such  that 

1.  Each  vertex  v  £  F  is  assigned  a  straight  line  connecting  it  to  one  of  the  four  boundaries  of 
the  grid. 

2.  The  straight  lines  are  non-intersecting. 

The  algorithm  developed  by  us  appears  to  be  the  first-known  polynomial  time  algorithm  for  this 
problem. 

The  algorithms  discussed  so  far  focus  on  the  satisfiability  question,  i.e.,  whether  all  the  faulty 
PEs  can  be  replaced  by  straight  and  non-intersecting  compensation  paths.  Often,  however,  a 
more  relevant  issue  might  be  to  determine  the  maximum  number  of  faulty  processors  that  can  be 
replaced.  In  [1]  a  polynomial  time  algorithm  of  time  complexity  0(|F|3)  was  developed  for  solving 
the  corresponding  combinatorial  problem. 

The  motivation  behind  some  of  the  results  to  be  described  next  is  to  introduce  a  minimal  amount 
of  additional  hardware,  that  will  allow  a  much  larger  class  of  compensation  paths  than  the  restricted 
class  of  straight  paths.  One  can  easily  observe  that  the  sufficient  conditions  for  reconfiguration  as 
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discussed  above  are  also  valid  for  more  general  array  models  such  as  the  ones  with  multiple-tracks. 
One,  however,  hopes  that  for  such  more  powerful  models  it  should  be  possible  to  develop  more 
general  conditions  to  allow  reconfiguration  of  arrays  that  otherwise  could  not  be  reconfigured  in 
the  single-track  model.  With  such  motivation  in  mind,  we  first  considered  an  augmented  single- 
track  switch  model  as  shown  in  Fig.  9.  We  showed  that  the  augmented  model  is  more  powerful 
than  the  simple  single-track  model:  the  compensation  paths  in  the  augmented  model  need  not 
be  straight  any  more  and  can  have  bends.  In  general,  if  one  allows  multiple  data  paths  along 
every  grid  line  (i.e.  ,  a  multiple-track  model),  then  the  compensation  paths  can  be  crooked  and 
the  restriction  of  near  misses  is  no  longer  required.  Hence,  a  generalized  sufficient  condition  can 
be  stated  as  folloics:  an  array  grid  model  with  multiple-track  switches  is  7-econfigurable  if  one  can 
determine  a  set  of  non-intersecting  compensation  paths  (continuous,  but  not  necessarily  straight) 
for  the  faulty  PEs  in  the  array.  We  also  showed  that  the  combinatorial  problem  corresponding  to 
such  a  sufficient  condition  can  be  efficiently  solved  by  reducing  it  to  the  well  known  problem  of 
determining  maximum-flow  in  networks. 

In  a  more  recent  work,  we  have  made  progress  in  developing  new  reconfiguration  algorithms 
(see  [44])  for  a  different  set  of  models  that  were  first  introduced  in  [39]  (see  also  [3,  20,  46]).  The 
appeal  of  this  set  of  models  is  its  hardware  simplicity.  It  consists  of  an  N  X  (N  +  1)  array,  where 
A"  spare  PEs  are  configured  in  the  form  of  a  spare  column,  and  are  located  along  one  boundary  of 
the  original  N  X  N  array.  The  goal  of  the  reconfiguration  algorithm  is  to  obtain  an  N  x  .V  array 
of  healthy  processors  given  that  any  N  of  the  PEs  are  faulty;  moreover,  the  reconfigured  array 
must  satisfy  certain  neighborhood  constraints.  Let  the  physical  array  be  the  array  given  to  us.  and 
let  the  logical  array  be  the  reconfigured  array.  Then  the  neighborhood  constraint  requires  that  in 
the  logical  array  the  neighbors  of  a  healthy  processor  be  restricted  to  lie  in  a  fixed  neighborhood 
around  the  healthy  processor. 

The  reconfiguration  algorithms  presented  in  [39]  for  the  above  model  are  very  fragile,  and  fail  to 
reconfigure  even  when  the  faulty  PE  distributions  are  very  simple;  the  following  are  only  two  simple 
instances  (one  can  generate  several  other  instances ):  (1)  If  the  bottom  row  or  the  top  row  does 
not  have  any  faulty  processors,  (2)  If  there  are  two  columns  each  with  a  stack  of  faulty  processors, 
even  of  size  two. 

The  algorithms  that  we  have  developed  can  reconfigure  all  instances  of  arrays  that  the  algorithms 
reported  in  [39]  can.  Moreover,  we  can  also  reconfigure  arrays  with  several  other  faulty  distributions, 
without  relaxing  the  neighborhood  constraints.  In  particular,  we  give  reconfiguration  algorithms  for 
the  following  cases  ( 1)  If  every  column  has  one  faulty  PE  (2)  If  one  column  has  two  faulty  PEs  and 
the  rest  of  the  columns  have  one  faulty  PE  each  or  no  faulty  PEs,  (3)  If  there  are  two  stacks  of 
faulty  PEs,  each  of  length  N/2,  and  (4)  If  the  faulty  PEs  can  be  partitioned  into  blocks  such  that 
each  block  can  be  considered  as  one  of  the  above  special  cases. 
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The  reconfiguration  algorithms  are  based  on  one  simple  algorithm  that  shows  how  to  reconfigure 
an  N  x  (N  +  1)  array  into  an  (N  +  1)  x  .V  array  and  vice  versa.  Moreover,  our  algorithms  generate 
very  regular  patterns,  and  can  be  implemented  in  a  distributed  fashion  instead  of  being  processed 
by  one  host  processor.  We  should  also  mention  here  that  in  [3],  a  model  similar  to  ours  is  considered; 
however,  they  do  not  impose  the  strict  neighborhood  restrictions.  We  can  show  that  the  algorithms 
in  [3]  will  require  larger  neighborhood  for  arrays  that  can  be  reconfigured  with  smaller  neighborhoods 
with  our  algorithm;  the  details  will  be  provided  in  the  full  paper. 

We  are  continuing  our  studies  on  developing  more  powerful  models  that  have  high  reconfigura¬ 
tion  probability  and  yet  does  not  require  too  much  extra  hardware. 
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Figure  5:  A  general  model  for  a  reconfigurable  processor  array;  the  circles  show  possible  locations 
of  switches  in  the  array. 
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Figure  6:  The  array  grid  model  based  on  single-track  switches,  shown  with  the  possible  states  of  a 
switch. 
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Figure  7:  A  compensation  path  and  the  corresponding  routing  required  for  replacing  a  single  faulty 
processor  in  the  single-track  model. 


(b) 


Figure  8:  Near-miss  situations  among  compensation  paths:  figure(a)  shows  a  near-miss  situation: 
figure(b)  shows  a  situation  that  does  not  correspond  to  a  near-miss. 
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Figure  9:  An  augmented  single-track  model. 
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